Chromatin immunoprecipitation experiments followed by sequencing (ChIP–seq) detect protein–DNA binding events and chemical modifications of histone proteins (Figure 1).
Comparison of ChIP-seq experimental protocols (adapted from Furey 2012, Nature Reviews).
Profiles obtained from transcription factor (TF) ChIP-seq and histone ChIP-seq are different (Figure 2):
The transcription factor and histone modification landscape for inactive and active cis-regulatory elements (CRE; promoters and enhancers) and the corresponding read profiles (from Pundhir et al. 2015, Frontiers in Genetics).
When sequencing ChIP-seq samples, we usually need a control for the antibody inespecific binding, known as input. We can use as input bulk DNA or DNA immunoprecipitated with IgG.
You have to consider that when sequencing ChIP-seq, many of the parameters you can choose depend on the specifics of your experiment, your question and your budget! However, to provide you with some general guidelines, this are some general recommendations:
The datasets we are going to use for this workshop are from Hlady et al., 2019 Hepatology, in which they compare several assays in normal liver with cirrotic liver and hepatocellular carcinoma (HCC) samples from human donors.
All raw data from their assays can be accessed via their GEO ID: GSE112221. The Gene Expression Omnibus (GEO) is a public functional genomics data repository that accepts array- and sequence-based data. Each dataset is given a GEO ID that usually starts with GSE. Inside each dataset, you can find independent samples with identifiers prefixed GSM.
In GEO you can download already processed data, such as bed files. However, we here are interested in downloading the raw fastq files so we can do all the processing ourselves. Therefore, for each sample we are interested in, we need to find the SRA ID (SRX, you can find it in the sample page in GEO) and from there, gather the run ID (SRR) which will give us access to the raw data.
+ More.You can download the files directly from the SRA website by clicking on the run ID, but you can also do so from the terminal using the SRA toolkit tool fastq-dump.
With fastq-dump you will be able to download fastq files directly from SRA. Since these samples are paired-end, you need to include the argument --split-3 to send each pair to a different file. --gzip will compress the fastq output to fastq.gz.
# Download fastq files from SRA database
fastq-dump --split-3 --gzip -O your_folder/ SRR6880492
Here you have a complete list of the samples we are going to use in this workshop and their SRA run IDs:
| Sample names | SRA run ID | Sequenced reads | Description |
|---|---|---|---|
| NL1_h3k27ac | SRR6880492 | 19,808,136 | H3K27ac ChIP-seq in Normal Liver |
| Cirr1_h3k27ac | SRR6880493 | 26,149,171 | H3K27ac ChIP-seq in Cirrotic Liver (Patient 1) |
| HCC1_h3k27ac | SRR6880494 | 18,789,828 | H3K27ac ChIP-seq in Hepatocellular Carcinoma (Patient 1) |
| Cirr3_h3k27ac | SRR6880495 | 25,896,625 | H3K27ac ChIP-seq in Cirrotic Liver (Patient 3) |
| HCC3_h3k27ac | SRR6880496 | 21,954,923 | H3K27ac ChIP-seq in Hepatocellular Carcinoma (Patient 3) |
| NL1_input | SRR6880507 | 35,050,237 | Input ChIP-seq in Normal Liver |
| Cirr1_input | SRR6880509 | 24,301,177 | Input ChIP-seq in Cirrotic Liver (Patient 1) |
| HCC1_input | SRR6880511 | 27,006,532 | Input ChIP-seq in Hepatocellular Carcinoma (Patient 1) |
| Cirr3_input | SRR6880513 | 36,198,135 | Input ChIP-seq in Cirrotic Liver (Patient 3) |
| HCC3_input | SRR6880515 | 14,131,132 | Input ChIP-seq in Hepatocellular Carcinoma (Patient 3) |
A general workflow for ChIP-seq analysis consists on the following steps:
Before starting any processing steps, one should check the overall quality of the reads obtained from the sequencing facility. To do so, we are going to use FastQC, a quality control tool for high-troughput sequencing data.
fastqc folder_with_fastq/sample_reads.fastq
After checking that the overall quality is good, one can trim reads and/or filter out reads with low overall quality using different tools1.
fastqc for the control liver sample.
fastqc fastq/NL1_H3K27ac.fastq
We call alignment (or mapping) to the process of comparing reads to a reference genome, scoring the similarity and then assigning most likely genomic coordinates to each read.
A reference genome is a database of DNA sequences, assembled as a representative of a species genome. Usually, one can find two versions of the reference genome: NCBI (starts with GRC) and UCSC (hg for human, mm for mice). The difference between the two basically lies in the naming of the chromosomes: NCBI uses numbers (i.e. 1, 2, X) and UCSC numbers plus chr (i.e. chr1, chr2, chrX).
Nowadays, the most used versions of the human genome are the following:
There are tools to convert coordinates from one build to another, such as liftOver from UCSC.
Aligners are tools specifically designed to map reads to a reference genome using different algorithms. Many aligners are available, but the most used for aligning ChIP-seq data are:
Both tools allow alignment of single and paired end reads, you just have to specify it when running them. Generating an index is a techinque that many aligners use to speed up their running type. You may think about it as a book index: when having an idex it is then easier and faster to find specific parts of the book (i.e. the reference genome). Paper describing indexing methods
The output of the aligner is a SAM/BAM file containing the read sequence and the assigned coordinates respect to the reference genome. The aligner will report the number of aligned reads (which should be >80% of the sequenced reads) and which of those where uniquely mapped or multi-mapping (they match more than one region in the genome). Aligners such as bowtie2 report the best alginment for multi-mapping reads.
SAM/BAM filesSAM (Sequence Alignment/Map) and BAM (same as SAM but binary, compressed and indexed) are file types used for storing sequence and alignment information.
Can you tell which one of the folloing files is in SAM and which one in BAM format?
File 1:
�BC�W[�Y�gbb&J��“����έ��q�鮙4۷t�
��P�m&=�m�z���� >+���ºB\(���|}�U�}[��U�._4�����T����������������/�WN?p�Bem��l�n��g��`4�.�'�����z������ �t��� !�I�\)��’1�M#0��@
3�>�”�0��/æ�R��P�T�D ��L&���#�1BM������P2��p��`I$�I}�M�D^t�P�h2f�cFɠ�ό�
File 2:
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
BAM/SAM files consist on two different parts:
@XX and contains information on the file itself: how it is ordered, which is the reference genome used, which program you used to align the reads, etc.You can find more information on all the fields present in the SAM/BAM files from the Samtools documentation.
Can you tell which lines are the header and which lines are the body of this SAM file?
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
@HD and @SQ are the header.
HD) is telling you the format version (VN:1.6) and that the file is sorted by coordinates (SO:coordinate).SQ) contains information on the reference genome used for the alignment: the name of the reference genome (SN:ref) and it length (LN:45).Before running an alignment, you will have to download the reference genome you want to use and index it.
For downloading hg19 you can go to UCSC downloads > Human > hg19 > Full data set > hg19.2bit. 2bit is a compression that UCSC uses for making fasta files smaller. You can convert a 2bit file into a fasta file by running twoBitToFasta from UCSC tools.
twoBitToFastq hg19.2bit hg19.fa # Uncompress the file
bowtie2-build hg19.fa hg19 --threads 5 # Create index with Bowtie2
We got our index and we got our reads, so let’s run the alignment! You can specify different parameters for the alignment (check bowtie2 -h to see them all), we are going to use the following:
-t: Print the time it takes to perform each step.-x: Index filename prefix (without the trailing .X.bt2).-U: Single-end reads file (if it were paired end you need to specify the first pair file with -1 and the second pair with -2). Fastq files can be gzipped (extension: .gz), so no need to unzip before aligning!-S: Specify file for SAM output.bowtie2 -t -x index -U single_end_reads.fastq -S output.sam
Once your alignment is done, you usually will have to perform some routine steps to process the the output:
SAM file. You should convert it to BAM, which is a more light-weight way of storing the same data.BAM file by genomic coordinates.BAM file: duplicates, reads mapping to ENCODE black-listed regions, chromosomes you are not interested in, etc.BAM file. As with reference genomes, you can create a index for your BAM file so operations requiring accessing parts of it will run much faster.Usually, for running operations on BAM or SAM files, the preferred program is Samtools.
SAM to BAMThis first step can be easily done using samtools view, with the following arguments:
-@ 4: Indicates the number (4) of additional threads to use (default is 0).-b: Tells samtools to convert to BAM.-o: Output filename for the BAM file.samtools view file.sam -@ 4 -b -o file.raw.bam
BAM file by coordinateMost programs need an input BAM file in which reads are sorted according to their genomic position. To order our BAM file we can use samtools sort with the following parameters:
-o: Output filename for the sorted BAM file.-m 2G: Tells Samtools to use up to 2G per thread.-@ 4: Uses 4 __additional__threads.samtools sort file.raw.bam -o file.srtd.bam -m 2G -@ 4
Do you remember bash pipes? You can use them with samtools too! You can pipe the output of samtools view (a BAM file) to samtools sort adding - , without having to create any intermediate files:
samtools view data/ChIP-seq/BAM/file.sam -b | samtools sort - -o data/ChIP-seq/BAM/file.srtd.bam
This step is usually recommended to avoid PCR amplification bias and also to remove potential biases induced by PCR errors that could then be amplificated. This step can be performed by several tools, but for the sake of simplicity we are going to use samtools markdup, which will remove all reads marked as duplicates (when argument -r is used).
Briefly, samtools markdup checks wether multiple reads have the same coordinates and orientation. When a duplicate is detected, the overall highest quality template is kept and all others have the duplicate flag set (and removed if you use -r)2.
The arguments we will use when running samtools markdup are the following:
-r: Remove duplicate reads.-s: Report statistics.-@ 4: Use 4 additional threads.samtools markdup data/ChIP-seq/BAM/file.srtd.bam data/ChIP-seq/BAM/file.rmdup.bam -r -@ 4 -s
As with reference genomes, BAM files need to be indexed so when querying sequences contained in it, we can do it in less amount of time. To do so, we just need to use samtools index. This will generate an additional file with the same name as your input BAM file, but adding the suffix .bai.
samtools index file.rmdup.bam
The last step of the processing of ChIP-seq data is the peak calling. In this step, we will run a program that using different algorithms will detect the regions in which there is an enrichment of ChIP-seq reads compared to the background, and will return a list with the genomic positions of this regions (i.e. peaks).
One of the most famous and most used peak callers is MACS2. MACS captures the influence of genome complexity to evaluate the significance of enriched ChIP regions and improves the spatial resolution through combining the information of both sequencing tag position and orientation.
In this step, we will need to use the H3K27ac sample and its input, so we can normalize and remove the inespecific events. Therefore, we will run MACS2 with the following arguments:
-f: Indicates the format of the input file, in this case BAM.-t: Treatment file (in this case, H3K27ac).-c: Control or input file.-g: Effective genome size. hs indicates that we are using the human genome.-n: Name for the output files (without any suffixes).--tempdir: Directory for saving temporary files.--broad: Call broad peaks. Since we are analyzing histone marks, it will try to join consecutive peaks corresponding to nucleosomes sorrounding a CRE.--nomodel: Avoid estimation of fragment size.activate-macs-git-2017.5.15 # Activate MACS2
mkdir tmp/ # Create folder for temporary files
macs2 callpeak -f BAM -t file.rmdup.bam -c file_input.rmdup.bam -g hs -n file --tempdir tmp/ --broad --nomodel
An important step on the processing of ChIP-seq data, which can be done at any point is visualizing your data. You can visualize all files generated from the alignment forward using a genome browser. We will see some examples using IGV.
This are some example file types one can visualize in a genome browser:
BAM files. It is mandatory for the index to be present in the same folder as the BAM file. When you load a BAM file into IGV, you will see that it loads two different tracks: a track with all reads represented (missmatches between read and reference will be highlighted) and a coverage track, which is the sum of all reads aligning to a specific region.BED, broadPeak) will be represented as rectangles with their genomic coordinates.bigWig files. This type of file is approppriate when your interested in plotting the coverage of your track from a lightweight file, without any information regarding read sequence or quality. bigWig files can be generated from BAM files converting them to bedGraph (same as bigWig but without compression) and then to bigWig.The steps to generate a bigWig file are as follows:
BAM to bedGraph using genomeCoverageBed from bedtools.genomeCoverageBed -bg -scale NUM -ibam file.rmdup.bam -g chr_sizes > file.bdg
bedGraph to bigWig.bedGraphToBigWig file.bdg chr_sizes file.bw -unc